Document Classification with LSA and Pretopology
نویسندگان
چکیده
Latent semantic analysis is a computation method to demonstrate a major component of language learning and use. Thus, in this sense, it is a theory of meaning, such that it applies to and offers an explanation of phenomena of meaning in words and passages of words. This enables LSA to hold a strong position in the automated document classification, document analysis, etc. Though the experiments show that LSA can reach a very high accuracy in document classification, it also depends on the various factors such as quality and amount of training documents, characteristics of representative vector and composition of the to be classified documents, etc. On the other hand, pretopology is showing its strength in the fields of data classification and modeling. Besides, some applications, which are to strengthen the pretopology with visualization in the domain of classification, have shown promising results. In this paper two document classification algorithms based on pretopology and LSA are proposed, which are suitable for different situations, and their results with deft07 contest data are discussed. This work also shows future possibility of visualization integration, which could help human intervention in the classification process. RÉSUMÉ. L’Analyse de la Sémantique Latente (LSA) est une méthode de calcul qui permet de rendre compte de l’apprentissage du langage et de son utilisation. Dans ce sens, LSA est une théorie de la signification des mots et groupes de mots (paragraphes, passages, textes) et de leur emploi. Cette propriété permet à LSA d’occuper une position enviable dans la classification automatique de documents, l’analyse de documents, etc. Bien que de nombreuses expériences indiquent que LSA peut atteindre une grande précision dans la classification de documents, ses Studia Informatica Universalis. performances sont tributaires de facteurs tels que la qualité et la quantité de documents utilisés pour l’entraı̂nement, les caractéristiques des vecteurs représentatifs et la composition des documents à classer. De son côté, la prétopologie a montré son efficacité dans les domaines de la classification des données et de la modélisation. De plus, certaines applications ont renforcé la prétopologie en ajoutant la visualisation au domaine de la classification et ont donné des résultats prometteurs. Dans cet article, nous proposons deux algorithmes de classification des documents basés sur LSA et la prétopologie, algorithmes qui sont adaptés à des situations différentes et dont nous discutons les résultats obtenus quand ils sont appliqués aux données du défi DEFT07. Ce travail dessine également les possibilités futures d’intégration de la visualisation, intégration qui pourra contribuer à l’intervention humaine dans les processus de
منابع مشابه
Influence of domain information on Latent Semantic Analysis of Hindi text
The work presented in this paper is to evaluate the performance of Latent Semantic Analysis (LSA) model in capturing word correlations within text by including domain information in the process. The performance of the model is empirically evaluated by classification of Hindi text. The accuracies of classification are compared against plain LSA. An increase of 1.25% classification accuracy is ac...
متن کاملCapturing the semantic structure of documents using summaries in Supplemented Latent Semantic Analysis
Latent Semantic Analysis (LSA) is a mathematical technique that is used to capture the semantic structure of documents based on correlations among textual elements within them. Summaries of documents contain words that actually contribute towards the concepts of documents. In the present work, summaries are used in LSA along with supplementary information such as document category and domain in...
متن کاملDocument representation with Generalized Latent Semantic Analysis
Methods for dimensionality reduction, notably LSA, have been successfully applied to the information retrieval task and document classification. Recently, corpus-based association measures such as point-wise mutual information have been found to outperform LSA on a variety of tasks. We have developed an algorithmic framework that computes a low-dimensional vector space representation of documen...
متن کاملA New Document Embedding Method for News Classification
Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...
متن کاملLatent semantic sentence clustering for multi-document summarization
This thesis investigates the applicability of Latent Semantic Analysis (LSA) to sentence clustering for Multi-Document Summarization (MDS). In contrast to more shallow approaches like measuring similarity of sentences by word overlap in a traditional vector space model, LSA takes word usage patterns into account. So far LSA has been successfully applied to different Information Retrieval (IR) t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Stud. Inform. Univ.
دوره 8 شماره
صفحات -
تاریخ انتشار 2010